Skip to content

Conversation

@lucifer4330k
Copy link

@lucifer4330k lucifer4330k commented Nov 19, 2025

Description

This PR fixes issue #691 by adding file locking to local results file writes, preventing race conditions when multiple processes write to the same results.csv file simultaneously.

Problem

When running AMLB in local mode in parallel (e.g., on a cluster or large machine with NFS), multiple processes could try to append to the same local file (session_dir/scores/results.csv) at the same time. This caused one of the append operations to be dropped, resulting in data loss.

The global results file was already protected with file locking, but the local results file was not.

Solution

  • Added file_lock() protection around board.save(append=True) in the _save() method
  • Reuses the existing global_lock_timeout configuration (default 5 seconds)
  • Added proper error handling with informative logging if lock timeout occurs
  • Includes test to verify parallel saves work correctly

Changes

  • amlb/benchmark.py: Modified _save() method to wrap local file save with file locking
  • tests/unit/amlb/test_results_race_condition.py: Added comprehensive test for parallel saves

Testing

The test simulates 10 parallel saves to the same results file and verifies:

  • All data is saved (no data loss)
  • All expected rows are present
  • All task IDs and folds are accounted for

Fixes #691

Summary by CodeRabbit

  • Bug Fixes

    • Improved reliability when multiple processes save result files concurrently by adding file-locking with graceful timeout handling.
  • Tests

    • Added unit tests validating concurrent result saves and proper behavior when file-lock timeouts occur.

GitHub Copilot and others added 2 commits November 19, 2025 13:55
…nt race conditions

- Wrap board.save(append=True) with file_lock() in _save() method
- Use the same timeout as global saves (global_lock_timeout config)
- Add comprehensive error handling for lock timeout scenarios
- Add test to verify parallel saves work correctly without data loss

This prevents data loss when multiple processes write to the same
local results.csv file simultaneously in parallel execution scenarios.
@coderabbitai
Copy link

coderabbitai bot commented Nov 19, 2025

Walkthrough

Adds a file lock around the local results CSV write to prevent concurrent-process append races; lock acquisition includes a timeout and logs exceptions on timeout while allowing the process to continue saving the global results.

Changes

Cohort / File(s) Summary
File lock integration
amlb/benchmark.py
Wraps the local results file append in a file lock with timeout handling; logs exceptions on timeout and continues normal global save flow. No public API/signature changes.
Concurrent access tests
tests/unit/amlb/test_results_race_condition.py
Adds test_parallel_save_no_race_condition() to exercise concurrent saves and verify all entries persist; adds test_save_with_file_lock_timeout() to simulate lock acquisition timeout and assert graceful handling.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Proc as Process
    participant FS as Local FS (results.csv)
    participant Lock as FileLock
    participant Global as Global Scoreboard

    Proc->>Lock: acquire(results.csv) with timeout
    alt lock acquired
        Lock->>FS: append row
        FS-->>Lock: ack
        Lock->>Proc: release
        Proc->>Global: save global board
        Global-->>Proc: ack
    else timeout
        Lock-->>Proc: TimeoutError
        Proc->>Proc: log exception (timeout)
        Proc->>Global: save global board (continue)
        Global-->>Proc: ack
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Localized change affecting a single persistence point plus tests.
  • Review focus:
    • Confirm lock scope (only local append) is correct.
    • Validate timeout value and logged details provide actionable diagnostics.
    • Ensure tests reliably simulate contention and timeout behavior across platforms.

Poem

🐇 I thumped my foot and fixed the race,

Locks in paw, no scrambled trace.
Threads may hurry, files may clatter,
But scores now write — no rows will scatter.
A hop, a save, a safe, calm place.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix #691: Add file locking to local results file saves to prevent race conditions' directly and clearly identifies the main change: adding file locking to prevent race conditions in local results file saves.
Linked Issues check ✅ Passed The PR addresses all primary objectives from #691: wraps local results writes with file locking to prevent race conditions, preserves correct saving behavior with informative error handling, and includes tests verifying parallel saves work correctly.
Out of Scope Changes check ✅ Passed All changes are scoped to the stated objectives: amlb/benchmark.py adds file locking to local results writes, and the test file adds race-condition validation tests. No unrelated modifications detected.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 16bfc7e and 82ad505.

📒 Files selected for processing (1)
  • tests/unit/amlb/test_results_race_condition.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/unit/amlb/test_results_race_condition.py (1)
amlb/results.py (5)
  • Scoreboard (82-349)
  • append (312-327)
  • save (308-310)
  • all (86-87)
  • as_data_frame (189-237)
🪛 Ruff (0.14.5)
tests/unit/amlb/test_results_race_condition.py

113-113: Unused function argument: args

(ARG001)


113-113: Unused function argument: kwargs

(ARG001)


114-114: Avoid specifying long messages outside the exception class

(TRY003)


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dfe8d21 and 16bfc7e.

📒 Files selected for processing (2)
  • amlb/benchmark.py (1 hunks)
  • tests/unit/amlb/test_results_race_condition.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/unit/amlb/test_results_race_condition.py (2)
amlb/results.py (5)
  • Scoreboard (82-349)
  • append (312-327)
  • save (308-310)
  • all (86-87)
  • as_data_frame (189-237)
amlb/utils/process.py (1)
  • file_lock (35-49)
amlb/benchmark.py (3)
amlb/data.py (1)
  • path (120-121)
amlb/results.py (3)
  • path (330-349)
  • save (308-310)
  • append (312-327)
amlb/utils/process.py (1)
  • file_lock (35-49)
🪛 Ruff (0.14.5)
tests/unit/amlb/test_results_race_condition.py

115-115: Local variable original_file_lock is assigned to but never used

Remove assignment to unused variable original_file_lock

(F841)


117-117: Unused function argument: args

(ARG001)


117-117: Unused function argument: kwargs

(ARG001)


118-118: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)
amlb/benchmark.py (1)

474-489: Local results locking matches the intended fix; just confirm the timeout exception mapping

The new _save implementation correctly wraps the local board.save(append=True) in file_lock(local_path, timeout=rconfig().results.global_lock_timeout), mirroring the existing global lock behavior and addressing the race condition described in #691. Swallowing only TimeoutError and proceeding to _save_global(board) is a reasonable trade‑off: the run is still reflected in the global board, while the log message clearly calls out that local results may be incomplete.

One thing to double‑check: ensure that the filelock version you depend on still raises an exception type that subclasses TimeoutError for lock‑acquisition failures so that this except TimeoutError: block is reliably triggered. If that ever changes, you may want to catch the library’s specific timeout class instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race Condition When Writing To Local Result File

1 participant